At Santander our mission is to help people and businesses prosper. We are always looking for ways to help our customers understand their financial health and identify which products and services might help them achieve their monetary goals.

Our data science team is continually challenging our machine learning algorithms, working with the global data science community to make sure we can more accurately identify new ways to solve our most common challenge, binary classification problems such as: is a customer satisfied? Will a customer buy this product? Can a customer pay this loan?

In this challenge, we invite Kagglers to help us identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The data provided for this competition has the same structure as the real data we have available to solve this problem.

metric used = auc(since data is not balanced)

Note: Here i am working only on train data dropping the labels for test as aaic team used to work for better visualization

importing the necessary libraries

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns            #For plots
import warnings
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    
%matplotlib inline
In [0]:
train = pd.read_csv("/content/train.csv.zip")
test = pd.read_csv("/content/test.csv.zip")
In [0]:
train.shape, test.shape
Out[0]:
((200000, 202), (200000, 201))

As we can see from above that train and test, both contains 0.2 million rows.

In [0]:
train.head(5)
Out[0]:
ID_code target var_0 var_1 var_2 var_3 var_4 var_5 var_6 var_7 var_8 var_9 var_10 var_11 var_12 var_13 var_14 var_15 var_16 var_17 var_18 var_19 var_20 var_21 var_22 var_23 var_24 var_25 var_26 var_27 var_28 var_29 var_30 var_31 var_32 var_33 var_34 var_35 var_36 var_37 ... var_160 var_161 var_162 var_163 var_164 var_165 var_166 var_167 var_168 var_169 var_170 var_171 var_172 var_173 var_174 var_175 var_176 var_177 var_178 var_179 var_180 var_181 var_182 var_183 var_184 var_185 var_186 var_187 var_188 var_189 var_190 var_191 var_192 var_193 var_194 var_195 var_196 var_197 var_198 var_199
0 train_0 0 8.9255 -6.7863 11.9081 5.0930 11.4607 -9.2834 5.1187 18.6266 -4.9200 5.7470 2.9252 3.1821 14.0137 0.5745 8.7989 14.5691 5.7487 -7.2393 4.2840 30.7133 10.5350 16.2191 2.5791 2.4716 14.3831 13.4325 -5.1488 -0.4073 4.9306 5.9965 -0.3085 12.9041 -3.8766 16.8911 11.1920 10.5785 0.6764 7.8871 ... 15.4576 5.3133 3.6159 5.0384 6.6760 12.6644 2.7004 -0.6975 9.5981 5.4879 -4.7645 -8.4254 20.8773 3.1531 18.5618 7.7423 -10.1245 13.7241 -3.5189 1.7202 -8.4051 9.0164 3.0657 14.3691 25.8398 5.8764 11.8411 -19.7159 17.5743 0.5857 4.4354 3.9642 3.1364 1.6910 18.5227 -2.3978 7.8784 8.5635 12.7803 -1.0914
1 train_1 0 11.5006 -4.1473 13.8588 5.3890 12.3622 7.0433 5.6208 16.5338 3.1468 8.0851 -0.4032 8.0585 14.0239 8.4135 5.4345 13.7003 13.8275 -15.5849 7.8000 28.5708 3.4287 2.7407 8.5524 3.3716 6.9779 13.8910 -11.7684 -2.5586 5.0464 0.5481 -9.2987 7.8755 1.2859 19.3710 11.3702 0.7399 2.7995 5.8434 ... 29.4846 5.8683 3.8208 15.8348 -5.0121 15.1345 3.2003 9.3192 3.8821 5.7999 5.5378 5.0988 22.0330 5.5134 30.2645 10.4968 -7.2352 16.5721 -7.3477 11.0752 -5.5937 9.4878 -14.9100 9.4245 22.5441 -4.8622 7.6543 -15.9319 13.3175 -0.3566 7.6421 7.7214 2.5837 10.9516 15.4305 2.0339 8.1267 8.7889 18.3560 1.9518
2 train_2 0 8.6093 -2.7457 12.0805 7.8928 10.5825 -9.0837 6.9427 14.6155 -4.9193 5.9525 -0.3249 -11.2648 14.1929 7.3124 7.5244 14.6472 7.6782 -1.7395 4.7011 20.4775 17.7559 18.1377 1.2145 3.5137 5.6777 13.2177 -7.9940 -2.9029 5.8463 6.1439 -11.1025 12.4858 -2.2871 19.0422 11.0449 4.1087 4.6974 6.9346 ... 13.2070 5.8442 4.7086 5.7141 -1.0410 20.5092 3.2790 -5.5952 7.3176 5.7690 -7.0927 -3.9116 7.2569 -5.8234 25.6820 10.9202 -0.3104 8.8438 -9.7009 2.4013 -4.2935 9.3908 -13.2648 3.1545 23.0866 -5.3000 5.3745 -6.2660 10.1934 -0.8417 2.9057 9.7905 1.6704 1.6858 21.6042 3.1417 -6.5213 8.2675 14.7222 0.3965
3 train_3 0 11.0604 -2.1518 8.9522 7.1957 12.5846 -1.8361 5.8428 14.9250 -5.8609 8.2450 2.3061 2.8102 13.8463 11.9704 6.4569 14.8372 10.7430 -0.4299 15.9426 13.7257 20.3010 12.5579 6.8202 2.7229 12.1354 13.7367 0.8135 -0.9059 5.9070 2.8407 -15.2398 10.4407 -2.5731 6.1796 10.6093 -5.9158 8.1723 2.8521 ... 31.8833 5.9684 7.2084 3.8899 -11.0882 17.2502 2.5881 -2.7018 0.5641 5.3430 -7.1541 -6.1920 18.2366 11.7134 14.7483 8.1013 11.8771 13.9552 -10.4701 5.6961 -3.7546 8.4117 1.8986 7.2601 -0.4639 -0.0498 7.9336 -12.8279 12.4124 1.8489 4.4666 4.7433 0.7178 1.4214 23.0347 -1.2706 -2.9275 10.2922 17.9697 -8.9996
4 train_4 0 9.8369 -1.4834 12.8746 6.6375 12.2772 2.4486 5.9405 19.2514 6.2654 7.6784 -9.4458 -12.1419 13.8481 7.8895 7.7894 15.0553 8.4871 -3.0680 6.5263 11.3152 21.4246 18.9608 10.1102 2.7142 14.2080 13.5433 3.1736 -3.3423 5.9015 7.9352 -3.1582 9.4668 -0.0083 19.3239 12.4057 0.6329 2.7922 5.8184 ... 33.5107 5.6953 5.4663 18.2201 6.5769 21.2607 3.2304 -1.7759 3.1283 5.5518 1.4493 -2.6627 19.8056 2.3705 18.4685 16.3309 -3.3456 13.5261 1.7189 5.1743 -7.6938 9.7685 4.8910 12.2198 11.8503 -7.8931 6.4209 5.9270 16.0201 -0.2829 -1.4905 9.5214 -0.1508 9.1942 13.2876 -1.5121 3.9267 9.5031 17.9974 -8.8104

5 rows × 202 columns

In [0]:
test.head(5)
Out[0]:
ID_code var_0 var_1 var_2 var_3 var_4 var_5 var_6 var_7 var_8 var_9 var_10 var_11 var_12 var_13 var_14 var_15 var_16 var_17 var_18 var_19 var_20 var_21 var_22 var_23 var_24 var_25 var_26 var_27 var_28 var_29 var_30 var_31 var_32 var_33 var_34 var_35 var_36 var_37 var_38 ... var_160 var_161 var_162 var_163 var_164 var_165 var_166 var_167 var_168 var_169 var_170 var_171 var_172 var_173 var_174 var_175 var_176 var_177 var_178 var_179 var_180 var_181 var_182 var_183 var_184 var_185 var_186 var_187 var_188 var_189 var_190 var_191 var_192 var_193 var_194 var_195 var_196 var_197 var_198 var_199
0 test_0 11.0656 7.7798 12.9536 9.4292 11.4327 -2.3805 5.8493 18.2675 2.1337 8.8100 -2.0248 -4.3554 13.9696 0.3458 7.5408 14.5001 7.7028 -19.0919 15.5806 16.1763 3.7088 18.8064 1.5899 3.0654 6.4509 14.1192 -9.4902 -2.1917 5.7107 3.7864 -1.7981 9.2645 2.0657 12.7753 11.3334 8.1462 -0.0610 3.5331 9.7804 ... 5.9232 5.4113 3.8302 5.7380 -8.6105 22.9530 2.5531 -0.2836 4.3416 5.1855 4.2603 1.6779 29.0849 8.4685 18.1317 12.2818 -0.6912 10.2226 -5.5579 2.2926 -4.5358 10.3903 -15.4937 3.9697 31.3521 -1.1651 9.2874 -23.5705 13.2643 1.6591 -2.1556 11.8495 -1.4300 2.4508 13.7112 2.4669 4.3654 10.7200 15.4722 -8.7197
1 test_1 8.5304 1.2543 11.3047 5.1858 9.1974 -4.0117 6.0196 18.6316 -4.4131 5.9739 -1.3809 -0.3310 14.1129 2.5667 5.4988 14.1853 7.0196 4.6564 29.1609 0.0910 12.1469 3.1389 5.2578 2.4228 16.2064 13.5023 -5.2341 -3.6648 5.7080 2.9965 -10.4720 11.4938 -0.9660 15.3445 10.6361 0.8966 6.7428 2.3421 12.8678 ... 30.9641 5.6723 3.6873 13.0429 -10.6572 15.5134 3.2185 9.0535 7.0535 5.3924 -0.7720 -8.1783 29.9227 -5.6274 10.5018 9.6083 -0.4935 8.1696 -4.3605 5.2110 0.4087 12.0030 -10.3812 5.8496 25.1958 -8.8468 11.8263 -8.7112 15.9072 0.9812 10.6165 8.8349 0.9403 10.1282 15.5765 0.4773 -1.4852 9.8714 19.1293 -20.9760
2 test_2 5.4827 -10.3581 10.1407 7.0479 10.2628 9.8052 4.8950 20.2537 1.5233 8.3442 -4.7057 -3.0422 13.6751 3.8183 10.8535 14.2126 9.8837 2.6541 21.2181 20.8163 12.4666 12.3696 4.7473 2.7936 5.2189 13.5670 -15.4246 -0.1655 7.2633 3.4310 -9.1508 9.7320 3.1062 22.3076 11.9593 9.9255 4.0702 4.9934 8.0667 ... 39.3654 5.5228 3.3159 4.3324 -0.5382 13.3009 3.1243 -4.1731 1.2330 6.1513 -0.0391 1.4950 16.8874 -2.9787 27.4035 15.8819 -10.9660 15.6415 -9.4056 4.4611 -3.0835 8.5549 -2.8517 13.4770 24.4721 -3.4824 4.9178 -2.0720 11.5390 1.1821 -0.7484 10.9935 1.9803 2.1800 12.9813 2.1281 -7.1086 7.0618 19.8956 -23.1794
3 test_3 8.5374 -1.3222 12.0220 6.5749 8.8458 3.1744 4.9397 20.5660 3.3755 7.4578 0.0095 -5.0659 14.0526 13.5010 8.7660 14.7352 10.0383 -15.3508 2.1273 21.4797 14.5372 12.5527 2.9707 4.2398 13.7796 14.1408 1.0061 -1.3479 5.2570 6.5911 6.2161 9.5540 2.3628 10.2124 10.8047 -2.5588 6.0720 3.2613 16.5632 ... 19.7251 5.3882 3.6775 7.4753 -11.0780 24.8712 2.6415 2.2673 7.2788 5.6406 7.2048 3.4504 2.4130 11.1674 14.5499 10.6151 -5.7922 13.9407 7.1078 1.1019 9.4590 9.8243 5.9917 5.1634 8.1154 3.6638 3.3102 -19.7819 13.4499 1.3104 9.5702 9.0766 1.6580 3.5813 15.1874 3.1656 3.9567 9.2295 13.0168 -4.2108
4 test_4 11.7058 -0.1327 14.1295 7.7506 9.1035 -8.5848 6.8595 10.6048 2.9890 7.1437 5.1025 -3.2827 14.1013 8.9672 4.7276 14.5811 11.8615 3.1480 18.0126 13.8006 1.6026 16.3059 6.7954 3.6015 13.6569 13.8807 8.6228 -2.2654 5.2255 7.0165 -15.6961 10.6239 -4.7674 17.5447 11.8668 3.0154 4.2546 6.7601 5.9613 ... 22.8700 5.6688 6.1159 13.2433 -11.9785 26.2040 3.2348 -5.5775 5.7036 6.1717 -1.6039 -2.4866 17.2728 2.3640 14.0037 12.9165 -12.0311 10.1161 -8.7562 6.0889 -1.3620 10.3559 -7.4915 9.4588 3.9829 5.8580 8.3635 -24.8254 11.4928 1.6321 4.2259 9.1723 1.2835 3.3778 19.5542 -0.2860 -5.1612 7.2882 13.9260 -9.1846

5 rows × 201 columns

In [0]:
#target = train["target"]
In [0]:
#train = train.drop(["target"], axis=1)

Checking missing values if any

In [0]:
train.isnull().sum()
Out[0]:
ID_code    0
target     0
var_0      0
var_1      0
var_2      0
var_3      0
var_4      0
var_5      0
var_6      0
var_7      0
var_8      0
var_9      0
var_10     0
var_11     0
var_12     0
var_13     0
var_14     0
var_15     0
var_16     0
var_17     0
var_18     0
var_19     0
var_20     0
var_21     0
var_22     0
var_23     0
var_24     0
var_25     0
var_26     0
var_27     0
          ..
var_170    0
var_171    0
var_172    0
var_173    0
var_174    0
var_175    0
var_176    0
var_177    0
var_178    0
var_179    0
var_180    0
var_181    0
var_182    0
var_183    0
var_184    0
var_185    0
var_186    0
var_187    0
var_188    0
var_189    0
var_190    0
var_191    0
var_192    0
var_193    0
var_194    0
var_195    0
var_196    0
var_197    0
var_198    0
var_199    0
Length: 202, dtype: int64

we can see that there is no null value in train

In [0]:
test.isnull().sum()
Out[0]:
ID_code    0
var_0      0
var_1      0
var_2      0
var_3      0
var_4      0
var_5      0
var_6      0
var_7      0
var_8      0
var_9      0
var_10     0
var_11     0
var_12     0
var_13     0
var_14     0
var_15     0
var_16     0
var_17     0
var_18     0
var_19     0
var_20     0
var_21     0
var_22     0
var_23     0
var_24     0
var_25     0
var_26     0
var_27     0
var_28     0
          ..
var_170    0
var_171    0
var_172    0
var_173    0
var_174    0
var_175    0
var_176    0
var_177    0
var_178    0
var_179    0
var_180    0
var_181    0
var_182    0
var_183    0
var_184    0
var_185    0
var_186    0
var_187    0
var_188    0
var_189    0
var_190    0
var_191    0
var_192    0
var_193    0
var_194    0
var_195    0
var_196    0
var_197    0
var_198    0
var_199    0
Length: 201, dtype: int64

We can see that there is no null values in test data

Lets describe train

In [0]:
train.describe()
Out[0]:
target var_0 var_1 var_2 var_3 var_4 var_5 var_6 var_7 var_8 var_9 var_10 var_11 var_12 var_13 var_14 var_15 var_16 var_17 var_18 var_19 var_20 var_21 var_22 var_23 var_24 var_25 var_26 var_27 var_28 var_29 var_30 var_31 var_32 var_33 var_34 var_35 var_36 var_37 var_38 ... var_160 var_161 var_162 var_163 var_164 var_165 var_166 var_167 var_168 var_169 var_170 var_171 var_172 var_173 var_174 var_175 var_176 var_177 var_178 var_179 var_180 var_181 var_182 var_183 var_184 var_185 var_186 var_187 var_188 var_189 var_190 var_191 var_192 var_193 var_194 var_195 var_196 var_197 var_198 var_199
count 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 ... 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000
mean 0.100490 10.679914 -1.627622 10.715192 6.796529 11.078333 -5.065317 5.408949 16.545850 0.284162 7.567236 0.394340 -3.245596 14.023978 8.530232 7.537606 14.573126 9.333264 -5.696731 15.244013 12.438567 13.290894 17.257883 4.305430 3.019540 10.584400 13.667496 -4.055133 -1.137908 5.532980 5.053874 -7.687740 10.393046 -0.512886 14.774147 11.434250 3.842499 2.187230 5.868899 10.642131 ... 24.259300 5.633293 5.362896 11.002170 -2.871906 19.315753 2.963335 -4.151155 4.937124 5.636008 -0.004962 -0.831777 19.817094 -0.677967 20.210677 11.640613 -2.799585 11.882933 -1.014064 2.591444 -2.741666 10.085518 0.719109 8.769088 12.756676 -3.983261 8.970274 -10.335043 15.377174 0.746072 3.234440 7.438408 1.927839 3.331774 17.993784 -0.142088 2.303335 8.908158 15.870720 -3.326537
std 0.300653 3.040051 4.050044 2.640894 2.043319 1.623150 7.863267 0.866607 3.418076 3.332634 1.235070 5.500793 5.970253 0.190059 4.639536 2.247908 0.411711 2.557421 6.712612 7.851370 7.996694 5.876254 8.196564 2.847958 0.526893 3.777245 0.285535 5.922210 1.523714 0.783367 2.615942 7.965198 2.159891 2.587830 4.322325 0.541614 5.179559 3.119978 2.249730 4.278903 ... 10.880263 0.217938 1.419612 5.262056 5.457784 5.024182 0.369684 7.798020 3.105986 0.369437 4.424621 5.378008 8.674171 5.966674 7.136427 2.892167 7.513939 2.628895 8.579810 2.798956 5.261243 1.371862 8.963434 4.474924 9.318280 4.725167 3.189759 11.574708 3.944604 0.976348 4.559922 3.023272 1.478423 3.992030 3.135162 1.429372 5.454369 0.921625 3.010945 10.438015
min 0.000000 0.408400 -15.043400 2.117100 -0.040200 5.074800 -32.562600 2.347300 5.349700 -10.505500 3.970500 -20.731300 -26.095000 13.434600 -6.011100 1.013300 13.076900 0.635100 -33.380200 -10.664200 -12.402500 -5.432200 -10.089000 -5.322500 1.209800 -0.678400 12.720000 -24.243100 -6.166800 2.089600 -4.787200 -34.798400 2.140600 -8.986100 1.508500 9.816900 -16.513600 -8.095100 -1.183400 -6.337100 ... -7.452200 4.852600 0.623100 -6.531700 -19.997700 3.816700 1.851200 -35.969500 -5.250200 4.258800 -14.506000 -22.479300 -11.453300 -22.748700 -2.995300 3.241500 -29.116500 4.952100 -29.273400 -7.856100 -22.037400 5.416500 -26.001100 -4.808200 -18.489700 -22.583300 -3.022300 -47.753600 4.412300 -2.554300 -14.093300 -2.691700 -3.814500 -11.783400 8.694400 -5.261000 -14.209600 5.960600 6.299300 -38.852800
25% 0.000000 8.453850 -4.740025 8.722475 5.254075 9.883175 -11.200350 4.767700 13.943800 -2.317800 6.618800 -3.594950 -7.510600 13.894000 5.072800 5.781875 14.262800 7.452275 -10.476225 9.177950 6.276475 8.627800 11.551000 2.182400 2.634100 7.613000 13.456400 -8.321725 -2.307900 4.992100 3.171700 -13.766175 8.870000 -2.500875 11.456300 11.032300 0.116975 -0.007125 4.125475 7.591050 ... 15.696125 5.470500 4.326100 7.029600 -7.094025 15.744550 2.699000 -9.643100 2.703200 5.374600 -3.258500 -4.720350 13.731775 -5.009525 15.064600 9.371600 -8.386500 9.808675 -7.395700 0.625575 -6.673900 9.084700 -6.064425 5.423100 5.663300 -7.360000 6.715200 -19.205125 12.501550 0.014900 -0.058825 5.157400 0.889775 0.584600 15.629800 -1.170700 -1.946925 8.252800 13.829700 -11.208475
50% 0.000000 10.524750 -1.608050 10.580000 6.825000 11.108250 -4.833150 5.385100 16.456800 0.393700 7.629600 0.487300 -3.286950 14.025500 8.604250 7.520300 14.574100 9.232050 -5.666350 15.196250 12.453900 13.196800 17.234250 4.275150 3.008650 10.380350 13.662500 -4.196900 -1.132100 5.534850 4.950200 -7.411750 10.365650 -0.497650 14.576000 11.435200 3.917750 2.198000 5.900650 10.562700 ... 23.864500 5.633500 5.359700 10.788700 -2.637800 19.270800 2.960200 -4.011600 4.761600 5.634300 0.002800 -0.807350 19.748000 -0.569750 20.206100 11.679800 -2.538450 11.737250 -0.942050 2.512300 -2.688800 10.036050 0.720200 8.600000 12.521000 -3.946950 8.902150 -10.209750 15.239450 0.742600 3.203600 7.347750 1.901300 3.396350 17.957950 -0.172700 2.408900 8.888200 15.934050 -2.819550
75% 0.000000 12.758200 1.358625 12.516700 8.324100 12.261125 0.924800 6.003000 19.102900 2.937900 8.584425 4.382925 0.852825 14.164200 12.274775 9.270425 14.874500 11.055900 -0.810775 21.013325 18.433300 17.879400 23.089050 6.293200 3.403800 13.479600 13.863700 -0.090200 0.015625 6.093700 6.798925 -1.443450 11.885000 1.469100 18.097125 11.844400 7.487725 4.460400 7.542400 13.598925 ... 32.622850 5.792000 6.371200 14.623900 1.323600 23.024025 3.241500 1.318725 7.020025 5.905400 3.096400 2.956800 25.907725 3.619900 25.641225 13.745500 2.704400 13.931300 5.338750 4.391125 0.996200 11.011300 7.499175 12.127425 19.456150 -0.590650 11.193800 -1.466000 18.345225 1.482900 6.406200 9.512525 2.949500 6.205800 20.396525 0.829600 6.556725 9.593300 18.064725 4.836800
max 1.000000 20.315000 10.376800 19.353000 13.188300 16.671400 17.251600 8.447700 27.691800 10.151300 11.150600 18.670200 17.188700 14.654500 22.331500 14.937700 15.863300 17.950600 19.025900 41.748000 35.183000 31.285900 49.044300 14.594500 4.875200 25.446000 14.654600 15.675100 3.243100 8.787400 13.143100 15.651500 20.171900 6.787100 29.546600 13.287800 21.528900 14.245600 11.863800 29.823500 ... 58.394200 6.309900 10.134400 27.564800 12.119300 38.332200 4.220400 21.276600 14.886100 7.089000 16.731900 17.917300 53.591900 18.855400 43.546800 20.854800 20.245200 20.596500 29.841300 13.448700 12.750500 14.393900 29.248700 23.704900 44.363400 12.997500 21.739200 22.786100 29.330300 4.034100 18.440900 16.716500 8.402400 18.281800 27.928800 4.272900 18.321500 12.000400 26.079100 28.500700

8 rows × 201 columns

Lets describe test

In [0]:
test.describe()
Out[0]:
var_0 var_1 var_2 var_3 var_4 var_5 var_6 var_7 var_8 var_9 var_10 var_11 var_12 var_13 var_14 var_15 var_16 var_17 var_18 var_19 var_20 var_21 var_22 var_23 var_24 var_25 var_26 var_27 var_28 var_29 var_30 var_31 var_32 var_33 var_34 var_35 var_36 var_37 var_38 var_39 ... var_160 var_161 var_162 var_163 var_164 var_165 var_166 var_167 var_168 var_169 var_170 var_171 var_172 var_173 var_174 var_175 var_176 var_177 var_178 var_179 var_180 var_181 var_182 var_183 var_184 var_185 var_186 var_187 var_188 var_189 var_190 var_191 var_192 var_193 var_194 var_195 var_196 var_197 var_198 var_199
count 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.00000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 ... 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000
mean 10.658737 -1.624244 10.707452 6.788214 11.076399 -5.050558 5.415164 16.529143 0.277135 7.569407 0.371335 -3.268551 14.022662 8.540872 7.532703 14.573704 9.321669 -5.70445 15.265776 12.456675 13.298428 17.230598 4.299010 3.019707 10.567479 13.666970 -3.983721 -1.129536 5.530656 5.047247 -7.687695 10.404920 -0.524830 14.762686 11.434861 3.870130 2.213288 5.875048 10.647806 0.672667 ... 24.146181 5.635300 5.360975 11.026376 -2.857328 19.320760 2.962821 -4.189133 4.930356 5.633716 -0.020824 -0.805148 19.779528 -0.666240 20.264135 11.635715 -2.776134 11.864538 -0.949318 2.582604 -2.722636 10.080827 0.651432 8.768929 12.719302 -3.963045 8.978800 -10.291919 15.366094 0.755673 3.189766 7.458269 1.925944 3.322016 17.996967 -0.133657 2.290899 8.912428 15.869184 -3.246342
std 3.036716 4.040509 2.633888 2.052724 1.616456 7.869293 0.864686 3.424482 3.333375 1.231865 5.508661 5.961443 0.190071 4.628712 2.255257 0.411592 2.544860 6.74646 7.846983 7.989812 5.884245 8.199877 2.844023 0.527951 3.771047 0.285454 5.945853 1.524765 0.785618 2.610078 7.971581 2.156324 2.588700 4.325727 0.541040 5.170614 3.120685 2.257235 4.260820 4.078592 ... 10.876184 0.217936 1.426064 5.268894 5.457937 5.039303 0.370668 7.827428 3.086443 0.365750 4.417876 5.378492 8.678024 5.987419 7.141816 2.884821 7.557001 2.626556 8.570314 2.803890 5.225554 1.369546 8.961936 4.464461 9.316889 4.724641 3.206635 11.562352 3.929227 0.976123 4.551239 3.025189 1.479966 3.995599 3.140652 1.429678 5.446346 0.920904 3.008717 10.398589
min 0.188700 -15.043400 2.355200 -0.022400 5.484400 -27.767000 2.216400 5.713700 -9.956000 4.243300 -22.672400 -25.811800 13.424500 -4.741300 0.670300 13.203400 0.314300 -28.90690 -11.324200 -12.699400 -2.634600 -9.940600 -5.164000 1.390600 -0.731300 12.749600 -24.536100 -6.040900 2.842500 -4.421500 -34.054800 1.309200 -8.209000 1.691100 9.776400 -16.923800 -10.466800 -0.885100 -5.368300 -14.083700 ... -8.925700 4.910600 0.106200 -6.093700 -21.514000 3.667300 1.813100 -37.176400 -5.405700 4.291500 -15.593200 -20.393600 -11.796600 -21.342800 -2.485400 2.951200 -29.838400 5.025300 -29.118500 -7.767400 -20.610600 5.346000 -28.092800 -5.476800 -17.011400 -22.467000 -2.303800 -47.306400 4.429100 -2.511500 -14.093300 -2.407000 -3.340900 -11.413100 9.382800 -4.911900 -13.944200 6.169600 6.584000 -39.457800
25% 8.442975 -4.700125 8.735600 5.230500 9.891075 -11.201400 4.772600 13.933900 -2.303900 6.623800 -3.626000 -7.522000 13.891000 5.073375 5.769500 14.262400 7.454400 -10.49790 9.237700 6.322300 8.589600 11.511500 2.178300 2.633300 7.610750 13.456200 -8.265500 -2.299000 4.986275 3.166200 -13.781900 8.880600 -2.518200 11.440500 11.033200 0.162400 0.016900 4.120700 7.601375 -2.170300 ... 15.567800 5.473000 4.308175 7.067775 -7.051200 15.751000 2.696500 -9.712000 2.729800 5.375200 -3.250000 -4.678450 13.722200 -4.998900 15.126500 9.382575 -8.408100 9.793700 -7.337925 0.605200 -6.604425 9.081000 -6.154625 5.432025 5.631700 -7.334000 6.705300 -19.136225 12.492600 0.019400 -0.095000 5.166500 0.882975 0.587600 15.634775 -1.160700 -1.948600 8.260075 13.847275 -11.124000
50% 10.513800 -1.590500 10.560700 6.822350 11.099750 -4.834100 5.391600 16.422700 0.372000 7.632000 0.491850 -3.314950 14.024600 8.617400 7.496950 14.572700 9.228900 -5.69820 15.203200 12.484250 13.218650 17.211300 4.269000 3.008000 10.344300 13.661200 -4.125800 -1.127800 5.529900 4.953100 -7.409000 10.385350 -0.535200 14.561400 11.435100 3.947500 2.219250 5.909800 10.563750 0.700850 ... 23.734400 5.636600 5.359800 10.820600 -2.618500 19.290300 2.961100 -4.080550 4.749300 5.632800 0.008000 -0.782800 19.723750 -0.564750 20.287200 11.668400 -2.515400 11.707650 -0.868300 2.496500 -2.671600 10.027200 0.675100 8.602300 12.493350 -3.927300 8.912850 -10.166800 15.211000 0.759700 3.162400 7.379000 1.892600 3.428500 17.977600 -0.162000 2.403600 8.892800 15.943400 -2.725950
75% 12.739600 1.343400 12.495025 8.327600 12.253400 0.942575 6.005800 19.094550 2.930025 8.584825 4.362400 0.832525 14.162900 12.270900 9.271125 14.875600 11.035500 -0.81160 21.014500 18.441950 17.914200 23.031600 6.278200 3.405700 13.467500 13.862800 -0.000700 0.026200 6.092200 6.793425 -1.464000 11.890900 1.460225 18.084425 11.843100 7.513375 4.485600 7.556750 13.615200 3.654800 ... 32.495275 5.794100 6.367900 14.645800 1.330300 23.040250 3.241600 1.313125 7.004400 5.898900 3.070100 2.982900 25.849600 3.652625 25.720000 13.748500 2.737700 13.902500 5.423900 4.384725 1.024600 11.002000 7.474700 12.126700 19.437600 -0.626300 11.227100 -1.438800 18.322925 1.495400 6.336475 9.531100 2.956000 6.174200 20.391725 0.837900 6.519800 9.595900 18.045200 4.935400
max 22.323400 9.385100 18.714100 13.142000 16.037100 17.253700 8.302500 28.292800 9.665500 11.003600 20.214500 16.771300 14.682000 21.605100 14.723100 15.798000 17.368700 19.15090 38.929000 35.432300 32.075800 47.417900 14.042600 5.024600 23.839600 14.596400 13.456400 3.371300 8.459900 12.953200 14.391500 19.471900 6.949600 29.247500 13.225100 22.318300 13.094100 12.014900 27.142700 14.167300 ... 64.291100 6.343700 10.194200 27.150300 11.885500 37.026700 4.216200 20.524400 14.983600 6.936400 16.846500 17.269200 53.426500 19.237600 42.758200 19.892200 19.677300 20.007800 27.956800 14.067500 13.991000 14.055900 28.255300 25.568500 44.363400 12.488600 21.699900 23.569900 28.885200 3.780300 20.359000 16.716500 8.005000 17.632600 27.947800 4.545400 15.920700 12.275800 26.538400 27.907400

8 rows × 200 columns

from this train_vis we are going to visualize 11 features as pair plot because visualizing all at one takes a lot of time

In [0]:
train_vis = train.iloc[:, 1:13]
In [0]:
train_vis.head(2)
Out[0]:
target var_0 var_1 var_2 var_3 var_4 var_5 var_6 var_7 var_8 var_9 var_10
0 0 8.9255 -6.7863 11.9081 5.093 11.4607 -9.2834 5.1187 18.6266 -4.9200 5.7470 2.9252
1 0 11.5006 -4.1473 13.8588 5.389 12.3622 7.0433 5.6208 16.5338 3.1468 8.0851 -0.4032

Data Analysis on train data

In [0]:
# PROVIDE CITATIONS TO YOUR CODE IF YOU TAKE IT FROM ANOTHER WEBSITE.
# https://matplotlib.org/gallery/pie_and_polar_charts/pie_and_donut_labels.html#sphx-glr-gallery-pie-and-polar-charts-pie-and-donut-labels-py


y_value_counts = train['target'].value_counts()
print("Number of people transacted the money in future ", y_value_counts[1], ", (", (y_value_counts[1]/(y_value_counts[1]+y_value_counts[0]))*100,"%)")
print("Number of people not transacted the money in future  ", y_value_counts[0], ", (", (y_value_counts[0]/(y_value_counts[1]+y_value_counts[0]))*100,"%)")
#above codes will give the%age of approved and not approved project

fig, ax = plt.subplots(figsize=(6, 6), subplot_kw=dict(aspect="equal"))
recipe = ["transacted", "not transacted"]

data = [y_value_counts[1], y_value_counts[0]]

wedges, texts = ax.pie(data, wedgeprops=dict(width=0.5), startangle=-40)

bbox_props = dict(boxstyle="square,pad=0.3", fc="w", ec="k", lw=0.72)
kw = dict(xycoords='data', textcoords='data', arrowprops=dict(arrowstyle="-"),
          bbox=bbox_props, zorder=0, va="center")

for i, p in enumerate(wedges):
    ang = (p.theta2 - p.theta1)/2. + p.theta1
    y = np.sin(np.deg2rad(ang))
    x = np.cos(np.deg2rad(ang))
    horizontalalignment = {-1: "right", 1: "left"}[int(np.sign(x))]
    connectionstyle = "angle,angleA=0,angleB={}".format(ang)
    kw["arrowprops"].update({"connectionstyle": connectionstyle})
    ax.annotate(recipe[i], xy=(x, y), xytext=(1.35*np.sign(x), 1.4*y),
                 horizontalalignment=horizontalalignment, **kw)

ax.set_title("Number of people transacted money or not")

plt.show()
Number of people transacted the money in future  20098 , ( 10.049 %)
Number of people not transacted the money in future   179902 , ( 89.95100000000001 %)

So from the above plot we can observe that the number of people transected the money os about 10% of the total data only.\ this is a purely imbalanced data.

Visualising with all the features is quite difficult so I am choosing 10 var columns to visualise as pair plot

PAIR PLOT only for first 10 var_0 to var_10

we can visualize relationship between two varioables with this

In [0]:
#https://seaborn.pydata.org/generated/seaborn.pairplot.html
plt.close()  #Closing all open window
sns.set_style("whitegrid");
sns.pairplot(train_vis, hue="target", height=3);
plt.show()
/usr/local/lib/python3.6/dist-packages/statsmodels/nonparametric/kde.py:487: RuntimeWarning: invalid value encountered in true_divide
  binned = fast_linbin(X, a, b, gridsize) / (delta * nobs)
/usr/local/lib/python3.6/dist-packages/statsmodels/nonparametric/kdetools.py:34: RuntimeWarning: invalid value encountered in double_scalars
  FAC1 = 2*(np.pi*bw/RANGE)**2

from this few features only we can see that both traget is easily seperable using any of the two features.\ although the data is imbalanced but easily seperable

In [0]:
def plot_feature_distribution(df1, df2, label1, label2, features):
    i = 0
    sns.set_style('whitegrid')
    plt.figure()
    fig, ax = plt.subplots(10,10,figsize=(18,22))

    for feature in features:
        i += 1
        plt.subplot(10,10,i)
        sns.distplot(df1[feature], hist=False,label=label1)
        sns.distplot(df2[feature], hist=False,label=label2)
        plt.xlabel(feature, fontsize=9)
        locs, labels = plt.xticks()
        plt.tick_params(axis='x', which='major', labelsize=6, pad=-6)
        plt.tick_params(axis='y', which='major', labelsize=6)
    plt.show();

first 100

here I am distributing dataset label wise

In [0]:
t0 = train.loc[train['target'] == 0]
t1 = train.loc[train['target'] == 1]
In [0]:
features = train.columns.values[2:102]
plot_feature_distribution(t0, t1, '0', '1', features)
<Figure size 432x288 with 0 Axes>

from features 100 -200

In [0]:
features = train.columns.values[102:202]
plot_feature_distribution(t0, t1, '0', '1', features)
<Figure size 432x288 with 0 Axes>

As we can see from above pdf that there is a lot of different distribution\ and for most of the data where label=1 and label=0 follows same distribution\ var_10 ,var_11, var_8, var_65, var_84 ect. follows same distribution like gaussian\ var_70, var_60, var_85 ect follows similar distribution\ var_80, var_86 etc follows similar distribution.\ similarly we can see from feature 102 to 202.

Visualising by tsne

In [0]:
train_5000 = train.head(20000)
y =train_5000["target"]
x = train_5000.iloc[:,2:202].values
In [0]:
# https://github.com/pavlin-policar/fastTSNE you can try this also, this version is little faster than sklearn 
#reference: aaic tsne
import numpy as np
from sklearn.manifold import TSNE
from sklearn import datasets
import pandas as pd
import matplotlib.pyplot as plt


tsne = TSNE(n_components=2, perplexity=30, learning_rate=200)

X_embedding = tsne.fit_transform(x)
# if x is a sparse matrix you need to pass it as X_embedding = tsne.fit_transform(x.toarray()) , .toarray() will convert the sparse matrix into dense matrix

for_tsne = np.vstack((X_embedding.T, y)).T#y.reshape(-1,1)
for_tsne_df = pd.DataFrame(data=for_tsne, columns=['Dim_1','Dim_2','label'])
# Ploting the result of tsne
sns.FacetGrid(for_tsne_df, hue="label", height=6).map(plt.scatter, 'Dim_1', 'Dim_2').add_legend()
plt.title("Visualise tsne ")
plt.show()

from the above tsne plot we can see that label 1 is not much seperable when we visualise it in 2d plot

Visualizing mean, median, std, kurtosis, skew, add, min, max, moving average of train and simultaneously doing feature engineering

In [0]:
features_train = train.columns.values[2:202]
features_test = test.columns.values[1:201]
row_mean_train = train[features_train].mean(axis=1)
train["row_mean"] =row_mean_train
row_mean_test = test[features_test].mean(axis=1)
test["row_mean"] = row_mean_test

Pdf gives the probabily of points lying in a certain range

In [0]:
#https://seaborn.pydata.org/generated/seaborn.distplot.html
sns.FacetGrid(train, hue = "target", height = 5)\
             .map(sns.distplot, "row_mean")\
             .add_legend()
plt.title("Histogram of mean")
plt.ylabel("Density of mean")
plt.plot()
Out[0]:
[]

from the above pdf we can say that when mean>6.2 and mean<7 then it is clear that probability of target=1 is high.

adding median

In [0]:
#reference : aaic haberman
In [0]:
row_median_train = train[features_train].median(axis=1)
train["row_median"] =row_median_train
row_median_test = test[features_test].median(axis=1)
test["row_median"] = row_median_test
In [0]:
#https://seaborn.pydata.org/generated/seaborn.distplot.html
sns.FacetGrid(train, hue = "target", height = 5)\
             .map(sns.distplot, "row_median")\
             .add_legend()
plt.title("Histogram of median")
plt.ylabel("Density of median")
plt.plot()
Out[0]:
[]

from the above pdf we can see that when median>6 and median<7 , the probability of target==1 is high

std

In [0]:
row_std_train = train[features_train].std(axis=1)
train["row_std"] =row_std_train
row_std_test = test[features_test].std(axis=1)
test["row_std"] = row_std_test
In [0]:
#https://seaborn.pydata.org/generated/seaborn.distplot.html
sns.FacetGrid(train, hue = "target", height = 5)\
             .map(sns.distplot, "row_std")\
             .add_legend()
plt.title("Histogram of std")
plt.ylabel("Density of std")
plt.plot()
Out[0]:
[]

it is clear from the above pdf that when std>9.2 and std<10.2 probability of target==1 is high.

min

In [0]:
row_min_train = train[features_train].min(axis=1)
train["row_min"] =row_min_train
row_min_test = test[features_test].min(axis=1)
test["row_min"] = row_min_test
In [0]:
#https://seaborn.pydata.org/generated/seaborn.distplot.html
sns.FacetGrid(train, hue = "target", height = 5)\
             .map(sns.distplot, "row_min")\
             .add_legend()
plt.title("Histogram of min")
plt.ylabel("Density of min")
plt.plot()
Out[0]:
[]

it is clear from the above pdf that when min<-20 and min>-50 probability of target==1 is high

max

In [0]:
row_max_train = train[features_train].max(axis=1)
train["row_max"] =row_max_train
row_max_test = test[features_test].max(axis=1)
test["row_max"] = row_max_test
In [0]:
#https://seaborn.pydata.org/generated/seaborn.distplot.html
sns.FacetGrid(train, hue = "target", height = 5)\
             .map(sns.distplot, "row_max")\
             .add_legend()
plt.title("Histogram of max")
plt.ylabel("Density of max")
plt.plot()
Out[0]:
[]

it is clear from the above pdf that when max>35 and min<45 probability of target==1 is high

Skew

In [0]:
row_skew_train = train[features_train].skew(axis=1)
train["row_skew"] =row_skew_train
row_skew_test = test[features_test].skew(axis=1)
test["row_skew"] = row_skew_test
In [0]:
#https://seaborn.pydata.org/generated/seaborn.distplot.html
sns.FacetGrid(train, hue = "target", height = 5)\
             .map(sns.distplot, "row_skew")\
             .add_legend()
plt.title("Histogram of skew")
plt.ylabel("Density of skew")
plt.plot()
Out[0]:
[]

kurtosis

In [0]:
row_kurt_train = train[features_train].kurtosis(axis=1)
train["row_kurt"] =row_kurt_train
row_kurt_test = test[features_test].kurtosis(axis=1)
test["row_kurt"] = row_kurt_test
In [0]:
#https://seaborn.pydata.org/generated/seaborn.distplot.html
sns.FacetGrid(train, hue = "target", height = 5)\
             .map(sns.distplot, "row_kurt")\
             .add_legend()
plt.title("Histogram of kurt")
plt.ylabel("Density of kurt")
plt.plot()
Out[0]:
[]

it is clear from the above pdf that when kurt>2 and kurt<4 probability of target==1 is high

sum

In [0]:
row_sum_train = train[features_train].sum(axis=1)
train["row_sum"] =row_sum_train
row_sum_test = test[features_test].sum(axis=1)
test["row_sum"] = row_sum_test
In [0]:
#https://seaborn.pydata.org/generated/seaborn.distplot.html
sns.FacetGrid(train, hue = "target", height = 5)\
             .map(sns.distplot, "row_sum")\
             .add_legend()
plt.title("Histogram of sum")
plt.ylabel("Density of sum")
plt.plot()
Out[0]:
[]

it is clear from the above pdf that when sum>1250 and sum<1450 probability of target==1 is high

moving sum mean

In [0]:
#https://www.kaggle.com/hjd810/keras-lgbm-aug-feature-eng-sampling-prediction
row_ma_train = train[features_train].apply(lambda x: np.ma.average(x), axis=1)
train["ma"] = row_ma_train
row_ma_test = test[features_test].apply(lambda x: np.ma.average(x), axis=1)
test["ma"] = row_ma_test
In [0]:
#https://seaborn.pydata.org/generated/seaborn.distplot.html
#https://docs.scipy.org/doc/numpy/reference/generated/numpy.ma.average.html
sns.FacetGrid(train, hue = "target", height = 5)\
             .map(sns.distplot, "ma")\
             .add_legend()
plt.title("Histogram of ma")
plt.ylabel("Density of ma")
plt.plot()
Out[0]:
[]

it is clear from the above pdf that when ma>6.2 and ma<7 probability of target==1 is high

Pdf and cdf using kde(Kernal Distribution Estimation)

In [0]:
t0 = train.loc[train['target'] == 0]
t1 = train.loc[train['target'] == 1]
In [0]:
#reference: aaic haberman
counts, bin_edges=np.histogram(t0["row_mean"], bins=10, density=True)
pdf=counts/(sum(counts))
print(pdf);    #this will return 10 values
print(bin_edges);  #this will return 11 values
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf, label="mean0pdf");
plt.plot(bin_edges[1:], cdf, label="mean0pdf");

counts, bin_edges=np.histogram(t1["row_mean"], bins=10, density=True)
pdf=counts/(sum(counts))
print(pdf);    #this will return 10 values
print(bin_edges);  #this will return 11 values
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf, label="mean1pdf");
plt.plot(bin_edges[1:], cdf, label="mean1pdf");
plt.legend()
plt.title("Pdf & Cdf of year")
plt.xlabel("Year")
plt.ylabel("percentage")
[1.50081711e-04 2.89046259e-03 2.75650076e-02 1.28225367e-01
 2.94043424e-01 3.22197641e-01 1.72721815e-01 4.59361208e-02
 5.89765539e-03 3.72424987e-04]
[4.9633175  5.31394435 5.6645712  6.01519805 6.3658249  6.71645175
 7.0670786  7.41770545 7.7683323  8.11895915 8.469586  ]
[0.00059707 0.00597074 0.04094935 0.13598368 0.26639467 0.2887352
 0.18260523 0.06398647 0.01293661 0.00184098]
[5.1600995 5.4754376 5.7907757 6.1061138 6.4214519 6.73679   7.0521281
 7.3674662 7.6828043 7.9981424 8.3134805]
Out[0]:
Text(0, 0.5, 'percentage')

from the above pdf and cdf we can say that 90 % of data lie below 7.5

Box_plot

In [0]:
#reference aaic haberman
sns.boxplot(x="target", y="row_sum", data=train)
plt.title("Boxplot for row_sum")
plt.plot()
Out[0]:
[]

distribution according to box plot is also same.

In [0]:
sns.boxplot(x="target", y="row_mean", data=train)
plt.title("Boxplot for mean")
plt.plot()
Out[0]:
[]

Visualize var_13 to var_17

In [0]:
#create a function which makes the plot:
#https://www.kaggle.com/sicongfang/eda-feature-engineering
from matplotlib.ticker import FormatStrFormatter
def visualize_numeric(ax1, ax2, ax3, df, col, target):
    #plot histogram:
    df.hist(column=col,ax=ax1,bins=200)
    ax1.set_xlabel('Histogram')
    
    #plot box-whiskers:
    df.boxplot(column=col,by=target,ax=ax2)
    ax2.set_xlabel('Transactions')
    
    #plot top 10 counts:
    cnt = df[col].value_counts().sort_values(ascending=False)
    cnt.head(10).plot(kind='barh',ax=ax3)
    ax3.invert_yaxis()  # labels read top-to-bottom
#     ax3.yaxis.set_major_formatter(FormatStrFormatter('%.2f')) #somehow not working 
    ax3.set_xlabel('Count')
In [0]:
##https://www.kaggle.com/sicongfang/eda-feature-engineering
for col in list(train.columns[10:20]):
    fig, axes = plt.subplots(1, 3,figsize=(10,3))
    ax11 = plt.subplot(1, 3, 1)
    ax21 = plt.subplot(1, 3, 2)
    ax31 = plt.subplot(1, 3, 3)
    fig.suptitle('Feature: %s'%col,fontsize=5)
    visualize_numeric(ax11,ax21,ax31,train,col,'target')
    plt.tight_layout()

->from the above we can conclude that data follows different distribution\ ->from the boxplot we can assume that for var_11 50% of its values lies with -8 to 0. and for for var_10 50% of its value lie within -5 to 5 and like wise for others we can conclude from boxplot\ ->from the above count plot we can see that maximum number of count of some particular value is variable in nature.

In [0]:
train.head(2)
Out[0]:
ID_code target var_0 var_1 var_2 var_3 var_4 var_5 var_6 var_7 var_8 var_9 var_10 var_11 var_12 var_13 var_14 var_15 var_16 var_17 var_18 var_19 var_20 var_21 var_22 var_23 var_24 var_25 var_26 var_27 var_28 var_29 var_30 var_31 var_32 var_33 var_34 var_35 var_36 var_37 ... var_169 var_170 var_171 var_172 var_173 var_174 var_175 var_176 var_177 var_178 var_179 var_180 var_181 var_182 var_183 var_184 var_185 var_186 var_187 var_188 var_189 var_190 var_191 var_192 var_193 var_194 var_195 var_196 var_197 var_198 var_199 row_mean row_median row_std row_min row_max row_skew row_kurt row_sum ma
0 train_0 0 8.9255 -6.7863 11.9081 5.093 11.4607 -9.2834 5.1187 18.6266 -4.9200 5.7470 2.9252 3.1821 14.0137 0.5745 8.7989 14.5691 5.7487 -7.2393 4.284 30.7133 10.5350 16.2191 2.5791 2.4716 14.3831 13.4325 -5.1488 -0.4073 4.9306 5.9965 -0.3085 12.9041 -3.8766 16.8911 11.1920 10.5785 0.6764 7.8871 ... 5.4879 -4.7645 -8.4254 20.8773 3.1531 18.5618 7.7423 -10.1245 13.7241 -3.5189 1.7202 -8.4051 9.0164 3.0657 14.3691 25.8398 5.8764 11.8411 -19.7159 17.5743 0.5857 4.4354 3.9642 3.1364 1.6910 18.5227 -2.3978 7.8784 8.5635 12.7803 -1.0914 7.281591 6.77040 9.33154 -21.4494 43.1127 0.101580 1.331023 1456.3182 7.281591
1 train_1 0 11.5006 -4.1473 13.8588 5.389 12.3622 7.0433 5.6208 16.5338 3.1468 8.0851 -0.4032 8.0585 14.0239 8.4135 5.4345 13.7003 13.8275 -15.5849 7.800 28.5708 3.4287 2.7407 8.5524 3.3716 6.9779 13.8910 -11.7684 -2.5586 5.0464 0.5481 -9.2987 7.8755 1.2859 19.3710 11.3702 0.7399 2.7995 5.8434 ... 5.7999 5.5378 5.0988 22.0330 5.5134 30.2645 10.4968 -7.2352 16.5721 -7.3477 11.0752 -5.5937 9.4878 -14.9100 9.4245 22.5441 -4.8622 7.6543 -15.9319 13.3175 -0.3566 7.6421 7.7214 2.5837 10.9516 15.4305 2.0339 8.1267 8.7889 18.3560 1.9518 7.076818 7.22315 10.33613 -47.3797 40.5632 -0.351734 4.110215 1415.3636 7.076818

2 rows × 211 columns

In [0]:
from google.colab import drive
drive.mount('/content/drive')
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive

Now saving all the feature engineered data to train_santander.csv

In [0]:
train.to_csv("/content/drive/My Drive/train_santander.csv")
In [0]:
test.to_csv("/content/drive/My Drive/test_santander.csv")

importing necessary libraries

In [0]:
import pandas as pd
import matplotlib.pyplot as plt
import re
import time
import warnings
import numpy as np
from nltk.corpus import stopwords
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import normalize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.manifold import TSNE
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics.classification import accuracy_score, log_loss
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from imblearn.over_sampling import SMOTE
from collections import Counter
from scipy.sparse import hstack
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold 
from collections import Counter, defaultdict
from sklearn.calibration import CalibratedClassifierCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import math
from sklearn.metrics import normalized_mutual_info_score
from sklearn.ensemble import RandomForestClassifier
warnings.filterwarnings("ignore")

from mlxtend.classifier import StackingClassifier

from sklearn import model_selection
from sklearn.linear_model import LogisticRegression

working only on train datasets just to see how well my model is doing.

In [0]:
train = pd.read_csv("/content/drive/My Drive/train_santander.csv")
#test = pd.read_csv("/content/drive/My Drive/test_santander.csv")
In [0]:
train.head(2)
Out[0]:
Unnamed: 0 ID_code target var_0 var_1 var_2 var_3 var_4 var_5 var_6 var_7 var_8 var_9 var_10 var_11 var_12 var_13 var_14 var_15 var_16 var_17 var_18 var_19 var_20 var_21 var_22 var_23 var_24 var_25 var_26 var_27 var_28 var_29 var_30 var_31 var_32 var_33 var_34 var_35 var_36 ... var_169 var_170 var_171 var_172 var_173 var_174 var_175 var_176 var_177 var_178 var_179 var_180 var_181 var_182 var_183 var_184 var_185 var_186 var_187 var_188 var_189 var_190 var_191 var_192 var_193 var_194 var_195 var_196 var_197 var_198 var_199 row_mean row_median row_std row_min row_max row_skew row_kurt row_sum ma
0 0 train_0 0 8.9255 -6.7863 11.9081 5.093 11.4607 -9.2834 5.1187 18.6266 -4.9200 5.7470 2.9252 3.1821 14.0137 0.5745 8.7989 14.5691 5.7487 -7.2393 4.284 30.7133 10.5350 16.2191 2.5791 2.4716 14.3831 13.4325 -5.1488 -0.4073 4.9306 5.9965 -0.3085 12.9041 -3.8766 16.8911 11.1920 10.5785 0.6764 ... 5.4879 -4.7645 -8.4254 20.8773 3.1531 18.5618 7.7423 -10.1245 13.7241 -3.5189 1.7202 -8.4051 9.0164 3.0657 14.3691 25.8398 5.8764 11.8411 -19.7159 17.5743 0.5857 4.4354 3.9642 3.1364 1.6910 18.5227 -2.3978 7.8784 8.5635 12.7803 -1.0914 7.281591 6.77040 9.33154 -21.4494 43.1127 0.101580 1.331023 1456.3182 7.281591
1 1 train_1 0 11.5006 -4.1473 13.8588 5.389 12.3622 7.0433 5.6208 16.5338 3.1468 8.0851 -0.4032 8.0585 14.0239 8.4135 5.4345 13.7003 13.8275 -15.5849 7.800 28.5708 3.4287 2.7407 8.5524 3.3716 6.9779 13.8910 -11.7684 -2.5586 5.0464 0.5481 -9.2987 7.8755 1.2859 19.3710 11.3702 0.7399 2.7995 ... 5.7999 5.5378 5.0988 22.0330 5.5134 30.2645 10.4968 -7.2352 16.5721 -7.3477 11.0752 -5.5937 9.4878 -14.9100 9.4245 22.5441 -4.8622 7.6543 -15.9319 13.3175 -0.3566 7.6421 7.7214 2.5837 10.9516 15.4305 2.0339 8.1267 8.7889 18.3560 1.9518 7.076818 7.22315 10.33613 -47.3797 40.5632 -0.351734 4.110215 1415.3636 7.076818

2 rows × 212 columns

as we can see from above that I have successfully added feature engineered features in train data

In [0]:
#target values
target = train["target"].values
In [0]:
#imp features from column 3 to 212
train = train.iloc[:,3:212]
In [0]:
train.shape
Out[0]:
(200000, 209)

Dividing train into train and test

In [0]:
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
from sklearn.model_selection import train_test_split
train, test, y_train, y_test = train_test_split(train, target, test_size=0.4)
In [0]:
train.head(2)
Out[0]:
var_0 var_1 var_2 var_3 var_4 var_5 var_6 var_7 var_8 var_9 var_10 var_11 var_12 var_13 var_14 var_15 var_16 var_17 var_18 var_19 var_20 var_21 var_22 var_23 var_24 var_25 var_26 var_27 var_28 var_29 var_30 var_31 var_32 var_33 var_34 var_35 var_36 var_37 var_38 var_39 ... var_169 var_170 var_171 var_172 var_173 var_174 var_175 var_176 var_177 var_178 var_179 var_180 var_181 var_182 var_183 var_184 var_185 var_186 var_187 var_188 var_189 var_190 var_191 var_192 var_193 var_194 var_195 var_196 var_197 var_198 var_199 row_mean row_median row_std row_min row_max row_skew row_kurt row_sum ma
41283 10.9731 -4.6904 12.0139 6.7342 10.7631 -7.8887 6.4342 14.6397 4.5487 6.4155 6.8424 -5.3654 14.3871 6.5517 4.2565 15.1091 12.0692 -5.5671 9.1865 11.3665 13.4977 23.2593 2.2732 3.3361 13.4062 13.9623 2.6071 -0.3359 7.0530 6.4352 5.5106 10.9938 -1.5205 12.4875 11.5376 4.9532 -0.5127 4.0366 14.8912 7.661 ... 5.6388 8.8027 4.6000 25.9067 8.3983 19.1504 11.3283 1.8560 9.9739 -5.7323 8.6185 4.1221 10.7098 2.0940 2.5167 27.4846 -5.5612 12.1537 -24.7277 14.5427 0.8251 1.0525 4.7981 4.4288 0.1983 18.7696 1.9181 -3.0739 7.1089 9.3686 -10.5074 6.661551 6.83830 8.880585 -25.078 35.8172 -0.595993 2.436583 1332.3101 6.661551
110029 9.8618 -7.5704 11.2805 7.8334 12.9967 6.8037 5.5669 17.9570 -3.5852 6.2679 -0.2374 -10.4112 14.0035 4.3036 8.2714 14.3894 15.9249 -12.9864 28.8711 0.8100 10.2118 17.6094 1.8527 3.3199 16.2854 13.8090 -1.9574 -0.2743 5.6834 5.6344 2.5132 6.9606 4.6371 9.4640 10.5797 3.1576 0.1909 3.3493 10.2330 2.381 ... 5.5711 -4.1805 1.0546 27.6319 -0.7639 15.1478 11.0255 -6.2949 13.2066 2.0838 4.3134 -15.9505 12.1111 11.0177 14.4541 12.8081 -7.9056 11.0321 -22.3733 11.3341 -0.8604 -1.1307 8.4142 1.6420 5.3260 19.1456 -0.2803 8.3954 10.1767 14.1339 2.1590 6.528441 6.89745 10.355810 -34.039 45.0858 -0.492301 3.390167 1305.6882 6.528441

2 rows × 209 columns

In [0]:
test.head(2)
Out[0]:
var_0 var_1 var_2 var_3 var_4 var_5 var_6 var_7 var_8 var_9 var_10 var_11 var_12 var_13 var_14 var_15 var_16 var_17 var_18 var_19 var_20 var_21 var_22 var_23 var_24 var_25 var_26 var_27 var_28 var_29 var_30 var_31 var_32 var_33 var_34 var_35 var_36 var_37 var_38 var_39 ... var_169 var_170 var_171 var_172 var_173 var_174 var_175 var_176 var_177 var_178 var_179 var_180 var_181 var_182 var_183 var_184 var_185 var_186 var_187 var_188 var_189 var_190 var_191 var_192 var_193 var_194 var_195 var_196 var_197 var_198 var_199 row_mean row_median row_std row_min row_max row_skew row_kurt row_sum ma
52412 6.3458 -5.5601 12.5416 5.1773 11.0935 -7.9247 7.5710 13.7295 -4.5946 6.7022 2.8411 -8.3097 14.4216 5.7221 6.0331 14.8097 8.6337 -7.5709 23.6559 28.672 16.9421 14.4065 1.7431 2.6967 3.7811 13.8501 -0.5899 -1.6085 6.6963 4.7548 6.3401 10.6486 -4.7741 24.2817 10.1831 -8.7066 3.1784 5.5028 18.4083 0.8747 ... 6.0622 -7.2482 4.6173 19.2730 -10.7901 28.1440 12.2735 -1.6377 11.8265 1.6917 2.8912 3.2364 10.4369 -5.4914 5.4465 -2.1703 2.4359 11.3872 -7.4573 15.7652 0.2683 5.5822 6.0076 2.5439 2.3071 23.8959 -0.7271 2.4738 9.5031 10.5450 -9.0234 7.045450 6.34295 10.171849 -23.4313 48.2625 0.379193 1.783650 1409.0901 7.045450
10937 10.3105 -5.4767 11.2129 3.9750 11.2268 2.2041 4.6284 11.6618 1.9519 7.8688 -2.9378 1.9969 13.9937 0.5358 5.8572 14.3016 9.0693 -0.2681 11.9068 8.129 12.0534 11.1221 2.5933 3.8284 10.1267 13.3848 -6.8049 -2.9323 6.3521 4.6972 -13.2172 11.4059 -2.4931 19.6337 11.5062 7.8407 2.8028 3.1814 4.1864 -6.1170 ... 5.7612 -2.7061 0.1949 11.1605 -8.4401 25.0392 11.8677 12.2229 15.5563 20.5926 -2.7311 -7.2906 9.0223 5.8180 17.7205 1.8794 -1.0380 9.7449 -17.5916 23.4232 1.1604 8.4095 8.7317 2.3419 1.6971 19.3005 0.9804 -6.3413 8.0463 17.0228 9.5817 6.696716 6.69065 10.068970 -44.3780 50.2550 -0.338030 4.724142 1339.3433 6.696716

2 rows × 209 columns

In [0]:
#https://stackoverflow.com/questions/26414913/normalize-columns-of-pandas-data-frame
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() 
scaler = scaler.fit(train) 
train = scaler.transform(train)
test = scaler.transform(test)

Now applying different ML algorithms

Logistic

Defining necessary functions.

In [0]:
def batch_predict(clf, data):
    # roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
    # not the predicted outputs

    y_data_pred = []
    tr_loop = data.shape[0] - data.shape[0]%1000
    # consider you X_tr shape is 49041, then your cr_loop will be 49041 - 49041%1000 = 49000
    # in this for loop we will iterate unti the last 1000 multiplier
    for i in range(0, tr_loop, 1000):
        y_data_pred.extend(clf.predict_proba(data[i:i+1000])[:,1])
    # we will be predicting for the last data points
    y_data_pred.extend(clf.predict_proba(data[tr_loop:])[:,1])
    
    return y_data_pred
In [0]:
#From facebook recommendation applied this code is taken and modified according to use
from sklearn.metrics import confusion_matrix
def plot_confusion_matrix(test_y, predict_y):
    C = confusion_matrix(test_y, predict_y)
    
    TN = C[0,0]       
    FP = C[0,1]  
    FN = C[1,0]
    TP = C[1,1]
    print("True Positive",TP)
    print("False Negative",FN)
    print("False Positive",FP)
    print("True Negative",TN)
    
    
    
    A =(((C.T)/(C.sum(axis=1))).T)
    
    B =(C/C.sum(axis=0))
    plt.figure(figsize=(30,6))
    
    labels = [0,1]
    # representing A in heatmap format
    cmap=sns.light_palette("Navy", as_cmap=True)#https://stackoverflow.com/questions/37902459/seaborn-color-palette-as-matplotlib-colormap
    plt.subplot(1, 3, 1)
    sns.heatmap(C, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
    plt.xlabel('Predicted Class')
    plt.ylabel('Original Class')
    plt.title("Confusion matrix")
    
    
    plt.show()
In [0]:
# we are writing our own function for predict, with defined thresould
# we will pick a threshold that will give the least fpr
def predict(proba, threshould, fpr, tpr):
    
    t = threshould[np.argmax(tpr*(1-fpr))]
    
    # (tpr*(1-fpr)) will be maximum if your fpr is very low and tpr is very high
    
    print("the maximum value of tpr*(1-fpr)", max(tpr*(1-fpr)), "for threshold", np.round(t,3))
    predictions = []
    for i in proba:
        if i>=t:
            predictions.append(1)
        else:
            predictions.append(0)
    return predictions
In [0]:
#As mentioned in logistic regression assignment I am changing alpha to log to plot a goog graph
import numpy as np
def log_alpha(al):
    alpha=[]
    for i in al:
        a=np.log(i)
        alpha.append(a)
    return alpha    

Logistic Regression

In [0]:
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import roc_auc_score

lg = SGDClassifier(loss='log', class_weight='balanced', penalty="l2")
alpha=[0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10]
parameters = {'alpha':alpha}
clf = GridSearchCV(lg, parameters, cv=3, scoring='roc_auc', n_jobs=-1, return_train_score=True,)
clf.fit(train, y_train)

print("Model with best parameters :\n",clf.best_estimator_)

alpha = log_alpha(alpha)


best_alpha = clf.best_estimator_.alpha
#best_split = clf.best_estimator_.min_samples_split

print(best_alpha)
#print(best_split)

train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']
cv_auc = clf.cv_results_['mean_test_score'] 
cv_auc_std= clf.cv_results_['std_test_score']

plt.plot(alpha, train_auc, label='Train AUC')
# this code is copied from here: https://stackoverflow.com/a/48803361/4084039
plt.gca().fill_between(alpha,train_auc - train_auc_std,train_auc + train_auc_std,alpha=0.2,color='darkblue')

plt.plot(alpha, cv_auc, label='CV AUC')
# this code is copied from here: https://stackoverflow.com/a/48803361/4084039
plt.gca().fill_between(alpha,cv_auc - cv_auc_std,cv_auc + cv_auc_std,alpha=0.2,color='darkorange')

plt.scatter(alpha, train_auc, label='Train AUC points')
plt.scatter(alpha, cv_auc, label='CV AUC points')


plt.legend()
plt.xlabel("alpha and l1")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()
Model with best parameters :
 SGDClassifier(alpha=0.0001, average=False, class_weight='balanced',
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=1000,
              n_iter_no_change=5, n_jobs=None, penalty='l2', power_t=0.5,
              random_state=None, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)
0.0001

From the above plot it is clearly visible that when alpha=0.0001 we have maximum auc.

Making final models with best alpha and penalty

In [0]:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc
from sklearn.calibration import CalibratedClassifierCV

lg = SGDClassifier(loss='log', alpha=best_alpha, penalty="l2", class_weight="balanced")
#lg.fit(train_1, project_data_y_train)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs

sig_clf = CalibratedClassifierCV(lg, method="isotonic")
lg = sig_clf.fit(train, y_train)


y_train_pred = lg.predict_proba(train)[:,1]   
y_test_pred = lg.predict_proba(test)[:,1] 

train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(y_test, y_test_pred)

plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel(" hyperparameter")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()

So the maximum auc here is 0.865

Confusion Matrix with using map

In [0]:
print('Train confusion_matrix')
plot_confusion_matrix(y_train,predict(y_train_pred, tr_thresholds, train_fpr, train_fpr))
Train confusion_matrix
the maximum value of tpr*(1-fpr) 0.24999986924548293 for threshold 0.033
True Positive 11404
False Negative 742
False Positive 53966
True Negative 53888
In [0]:
print('Test confusion_matrix')
plot_confusion_matrix(y_test,predict(y_test_pred, tr_thresholds, train_fpr, train_fpr))
Test confusion_matrix
the maximum value of tpr*(1-fpr) 0.24999986924548293 for threshold 0.033
True Positive 7438
False Negative 514
False Positive 35910
True Negative 36138

SVM

In [0]:
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import roc_auc_score

svm = SGDClassifier(loss='hinge', class_weight='balanced', penalty="l2")
alpha=[0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10]
parameters = {'alpha':alpha}
clf = GridSearchCV(svm, parameters, cv=3, scoring='roc_auc', n_jobs=-1, return_train_score=True,)
clf.fit(train, y_train)

print("Model with best parameters :\n",clf.best_estimator_)

alpha = log_alpha(alpha)


best_alpha = clf.best_estimator_.alpha
#best_split = clf.best_estimator_.min_samples_split

print(best_alpha)
#print(best_split)

train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']
cv_auc = clf.cv_results_['mean_test_score'] 
cv_auc_std= clf.cv_results_['std_test_score']

plt.plot(alpha, train_auc, label='Train AUC')
# this code is copied from here: https://stackoverflow.com/a/48803361/4084039
plt.gca().fill_between(alpha,train_auc - train_auc_std,train_auc + train_auc_std,alpha=0.2,color='darkblue')

plt.plot(alpha, cv_auc, label='CV AUC')
# this code is copied from here: https://stackoverflow.com/a/48803361/4084039
plt.gca().fill_between(alpha,cv_auc - cv_auc_std,cv_auc + cv_auc_std,alpha=0.2,color='darkorange')

plt.scatter(alpha, train_auc, label='Train AUC points')
plt.scatter(alpha, cv_auc, label='CV AUC points')


plt.legend()
plt.xlabel("alpha and l1")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()
Model with best parameters :
 SGDClassifier(alpha=0.0001, average=False, class_weight='balanced',
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=None, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)
0.0001

from the above auc plot it is clearly visible that when auc alpha = 0.0001 , we have maximum auc

Making final models with best alpha and penalty

In [0]:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc
from sklearn.calibration import CalibratedClassifierCV

svm = SGDClassifier(loss='hinge', alpha=best_alpha, penalty="l2", class_weight="balanced")
#svm.fit(train_1, project_data_y_train)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs

sig_clf = CalibratedClassifierCV(svm, method="isotonic")
svm = sig_clf.fit(train, y_train)


y_train_pred = svm.predict_proba(train)[:,1]   
y_test_pred = svm.predict_proba(test)[:,1] 

train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(y_test, y_test_pred)

plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel(" hyperparameter")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()

so from the above plot we can see tha test auc is 0.865

Confusion Matrix with using map

In [0]:
print('Train confusion_matrix')
plot_confusion_matrix(y_train,predict(y_train_pred, tr_thresholds, train_fpr, train_fpr))
Train confusion_matrix
the maximum value of tpr*(1-fpr) 0.24999788102032108 for threshold 0.031
True Positive 11389
False Negative 757
False Positive 53770
True Negative 54084
In [0]:
print('Test confusion_matrix')
plot_confusion_matrix(y_test,predict(y_test_pred, tr_thresholds, train_fpr, train_fpr))
Test confusion_matrix
the maximum value of tpr*(1-fpr) 0.24999788102032108 for threshold 0.031
True Positive 7442
False Negative 510
False Positive 35846
True Negative 36202

Naive Bayes

In [0]:
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
#https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB
naive = MultinomialNB(fit_prior=False)
alpha=[0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10]
parameters = {'alpha':alpha}
clf = GridSearchCV(naive, parameters, cv=3, scoring='roc_auc', return_train_score=True)
clf.fit(train, y_train)

print("Model with best parameters :\n",clf.best_estimator_)

train_auc= list(clf.cv_results_['mean_train_score'])
train_auc_std= clf.cv_results_['std_train_score']
cv_auc = list(clf.cv_results_['mean_test_score']) 
cv_auc_std= clf.cv_results_['std_test_score']

best_alpha=clf.best_estimator_.alpha

alpha = log_alpha(alpha)

plt.plot(alpha, train_auc, label='Train AUC')
# this code is copied from here: https://stackoverflow.com/a/48803361/4084039
plt.gca().fill_between(alpha,train_auc - train_auc_std,train_auc + train_auc_std,alpha=0.2,color='darkblue')

plt.plot(alpha, cv_auc, label='CV AUC')
# this code is copied from here: https://stackoverflow.com/a/48803361/4084039
plt.gca().fill_between(alpha,cv_auc - cv_auc_std,cv_auc + cv_auc_std,alpha=0.2,color='darkorange')

plt.scatter(alpha, train_auc, label='Train AUC points')
plt.scatter(alpha, cv_auc, label='CV AUC points')


plt.legend()
plt.xlabel("alpha and l1")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()
Model with best parameters :
 MultinomialNB(alpha=1e-05, class_prior=None, fit_prior=False)
In [0]:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc

naive = MultinomialNB(alpha=best_alpha, fit_prior=False)
naive.fit(train, y_train)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs

y_train_pred = naive.predict_proba(train)[:,1]    
y_test_pred = naive.predict_proba(test)[:,1]

train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(y_test, y_test_pred)

plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel(" hyperparameter")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()

from the above plot we can say that test auc is 0.855

Confusion Matrix using heat map

In [0]:
print('Train confusion_matrix')
plot_confusion_matrix(y_train,predict(y_train_pred, tr_thresholds, train_fpr, train_fpr))
Train confusion_matrix
the maximum value of tpr*(1-fpr) 0.2499999854717203 for threshold 0.486
True Positive 11266
False Negative 880
False Positive 53914
True Negative 53940
In [0]:
print('Test confusion_matrix')
plot_confusion_matrix(y_test,predict(y_test_pred, tr_thresholds, train_fpr, train_fpr))
Test confusion_matrix
the maximum value of tpr*(1-fpr) 0.2499999854717203 for threshold 0.486
True Positive 7376
False Negative 576
False Positive 35751
True Negative 36297
In [0]:
train = pd.read_csv("/content/drive/My Drive/train_santander.csv")
In [0]:
from sklearn.model_selection import train_test_split
train, test, y_train, y_test = train_test_split(train, target, test_size=0.4)
In [0]:
#taking all the columns except idcode, target, unnamed0
features = [c for c in train.columns if c not in ['ID_code', 'target',"Unnamed:0"]]
target = train['target']

importing necessary libraries

In [0]:
import gc
import os
import logging
import datetime
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import lightgbm as lgb
from tqdm import tqdm_notebook
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import StratifiedKFold
warnings.filterwarnings('ignore')
In [0]:
#setting parameters
param = {
    'bagging_freq': 5,
    'bagging_fraction': 0.4,
    'boost_from_average':'false',
    'boost': 'gbdt',
    'feature_fraction': 0.05,
    'learning_rate': 0.01,
    'max_depth': -1,  
    'metric':'auc',
    'min_data_in_leaf': 80,
    'min_sum_hessian_in_leaf': 10.0,
    'num_leaves': 13,
    'num_threads': 8,
    'tree_learner': 'serial',
    'objective': 'binary', 
    'verbosity': 1
}
In [0]:
#making 10 folds
folds = StratifiedKFold(n_splits=10, shuffle=False, random_state=44000)
oof = np.zeros(len(train))
predictions = np.zeros(len(test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(train.values, target)):
    print("Fold {}".format(fold_))
    trn_data = lgb.Dataset(train.iloc[trn_idx][features], label=target.iloc[trn_idx])
    val_data = lgb.Dataset(train.iloc[val_idx][features], label=target.iloc[val_idx])

    num_round = 1000000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=1000, early_stopping_rounds = 3000)
    oof[val_idx] = clf.predict(train.iloc[val_idx][features], num_iteration=clf.best_iteration)
    
    fold_importance_df = pd.DataFrame()
    fold_importance_df["Feature"] = features
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions += clf.predict(test[features], num_iteration=clf.best_iteration) / folds.n_splits

print("CV score: {:<8.5f}".format(roc_auc_score(target, oof)))
Fold 0
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.91012	valid_1's auc: 0.891048
[2000]	training's auc: 0.922393	valid_1's auc: 0.898032
[3000]	training's auc: 0.93073	valid_1's auc: 0.90145
[4000]	training's auc: 0.93779	valid_1's auc: 0.902904
[5000]	training's auc: 0.943841	valid_1's auc: 0.903862
[6000]	training's auc: 0.949329	valid_1's auc: 0.904328
[7000]	training's auc: 0.954276	valid_1's auc: 0.904513
[8000]	training's auc: 0.95882	valid_1's auc: 0.904236
[9000]	training's auc: 0.963171	valid_1's auc: 0.903923
Early stopping, best iteration is:
[6543]	training's auc: 0.952048	valid_1's auc: 0.904637
Fold 1
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.910776	valid_1's auc: 0.87999
[2000]	training's auc: 0.923381	valid_1's auc: 0.887051
[3000]	training's auc: 0.931882	valid_1's auc: 0.889855
[4000]	training's auc: 0.938715	valid_1's auc: 0.891421
[5000]	training's auc: 0.944728	valid_1's auc: 0.891751
[6000]	training's auc: 0.950076	valid_1's auc: 0.892179
[7000]	training's auc: 0.954983	valid_1's auc: 0.8922
[8000]	training's auc: 0.959543	valid_1's auc: 0.891963
[9000]	training's auc: 0.963822	valid_1's auc: 0.891783
Early stopping, best iteration is:
[6800]	training's auc: 0.954026	valid_1's auc: 0.892395
Fold 2
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.910917	valid_1's auc: 0.880874
[2000]	training's auc: 0.92327	valid_1's auc: 0.888438
[3000]	training's auc: 0.931566	valid_1's auc: 0.892268
[4000]	training's auc: 0.938475	valid_1's auc: 0.893897
[5000]	training's auc: 0.944514	valid_1's auc: 0.895186
[6000]	training's auc: 0.949925	valid_1's auc: 0.895621
[7000]	training's auc: 0.954805	valid_1's auc: 0.896052
[8000]	training's auc: 0.95937	valid_1's auc: 0.896057
[9000]	training's auc: 0.963674	valid_1's auc: 0.895937
[10000]	training's auc: 0.967612	valid_1's auc: 0.895907
Early stopping, best iteration is:
[7218]	training's auc: 0.955811	valid_1's auc: 0.896161
Fold 3
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.910332	valid_1's auc: 0.886326
[2000]	training's auc: 0.922851	valid_1's auc: 0.893769
[3000]	training's auc: 0.931158	valid_1's auc: 0.896971
[4000]	training's auc: 0.938053	valid_1's auc: 0.898329
[5000]	training's auc: 0.944101	valid_1's auc: 0.899213
[6000]	training's auc: 0.949485	valid_1's auc: 0.899612
[7000]	training's auc: 0.954551	valid_1's auc: 0.899746
[8000]	training's auc: 0.959176	valid_1's auc: 0.899663
[9000]	training's auc: 0.963421	valid_1's auc: 0.899466
Early stopping, best iteration is:
[6659]	training's auc: 0.952844	valid_1's auc: 0.899818
Fold 4
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.910884	valid_1's auc: 0.876176
[2000]	training's auc: 0.923198	valid_1's auc: 0.885391
[3000]	training's auc: 0.931728	valid_1's auc: 0.889177
[4000]	training's auc: 0.938842	valid_1's auc: 0.891406
[5000]	training's auc: 0.944914	valid_1's auc: 0.891953
[6000]	training's auc: 0.950405	valid_1's auc: 0.892415
[7000]	training's auc: 0.95528	valid_1's auc: 0.892463
[8000]	training's auc: 0.959821	valid_1's auc: 0.892343
[9000]	training's auc: 0.964021	valid_1's auc: 0.892289
Early stopping, best iteration is:
[6249]	training's auc: 0.951663	valid_1's auc: 0.89256
Fold 5
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.910378	valid_1's auc: 0.885394
[2000]	training's auc: 0.922775	valid_1's auc: 0.893774
[3000]	training's auc: 0.931203	valid_1's auc: 0.896869
[4000]	training's auc: 0.938301	valid_1's auc: 0.898908
[5000]	training's auc: 0.94434	valid_1's auc: 0.900033
[6000]	training's auc: 0.949729	valid_1's auc: 0.900666
[7000]	training's auc: 0.954701	valid_1's auc: 0.900446
[8000]	training's auc: 0.959163	valid_1's auc: 0.900884
[9000]	training's auc: 0.963404	valid_1's auc: 0.900944
[10000]	training's auc: 0.967309	valid_1's auc: 0.900969
[11000]	training's auc: 0.970908	valid_1's auc: 0.900706
[12000]	training's auc: 0.974286	valid_1's auc: 0.900744
Early stopping, best iteration is:
[9622]	training's auc: 0.965848	valid_1's auc: 0.901077
Fold 6
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.910555	valid_1's auc: 0.884567
[2000]	training's auc: 0.922579	valid_1's auc: 0.892911
[3000]	training's auc: 0.930862	valid_1's auc: 0.896466
[4000]	training's auc: 0.93785	valid_1's auc: 0.898559
[5000]	training's auc: 0.943937	valid_1's auc: 0.899583
[6000]	training's auc: 0.94941	valid_1's auc: 0.899824
[7000]	training's auc: 0.954497	valid_1's auc: 0.900188
[8000]	training's auc: 0.95906	valid_1's auc: 0.900387
[9000]	training's auc: 0.963334	valid_1's auc: 0.900317
[10000]	training's auc: 0.967349	valid_1's auc: 0.900313
[11000]	training's auc: 0.971038	valid_1's auc: 0.900474
Early stopping, best iteration is:
[8219]	training's auc: 0.960043	valid_1's auc: 0.900503
Fold 7
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.910169	valid_1's auc: 0.888964
[2000]	training's auc: 0.922202	valid_1's auc: 0.894795
[3000]	training's auc: 0.93062	valid_1's auc: 0.898717
[4000]	training's auc: 0.937701	valid_1's auc: 0.901172
[5000]	training's auc: 0.94373	valid_1's auc: 0.90236
[6000]	training's auc: 0.949189	valid_1's auc: 0.903216
[7000]	training's auc: 0.954234	valid_1's auc: 0.903715
[8000]	training's auc: 0.958932	valid_1's auc: 0.903623
[9000]	training's auc: 0.963206	valid_1's auc: 0.903486
[10000]	training's auc: 0.96721	valid_1's auc: 0.903111
Early stopping, best iteration is:
[7110]	training's auc: 0.954745	valid_1's auc: 0.903751
Fold 8
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.910458	valid_1's auc: 0.884755
[2000]	training's auc: 0.922732	valid_1's auc: 0.892515
[3000]	training's auc: 0.931118	valid_1's auc: 0.895593
[4000]	training's auc: 0.938013	valid_1's auc: 0.898047
[5000]	training's auc: 0.944101	valid_1's auc: 0.899248
[6000]	training's auc: 0.949529	valid_1's auc: 0.899813
[7000]	training's auc: 0.954588	valid_1's auc: 0.900217
[8000]	training's auc: 0.95922	valid_1's auc: 0.900048
[9000]	training's auc: 0.963411	valid_1's auc: 0.900011
Early stopping, best iteration is:
[6848]	training's auc: 0.953842	valid_1's auc: 0.900337
Fold 9
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.910958	valid_1's auc: 0.880668
[2000]	training's auc: 0.923063	valid_1's auc: 0.887606
[3000]	training's auc: 0.931616	valid_1's auc: 0.890702
[4000]	training's auc: 0.938643	valid_1's auc: 0.892194
[5000]	training's auc: 0.944671	valid_1's auc: 0.89263
[6000]	training's auc: 0.950148	valid_1's auc: 0.892885
[7000]	training's auc: 0.955041	valid_1's auc: 0.892787
[8000]	training's auc: 0.959587	valid_1's auc: 0.893121
[9000]	training's auc: 0.963853	valid_1's auc: 0.892884
[10000]	training's auc: 0.967755	valid_1's auc: 0.893041
[11000]	training's auc: 0.971413	valid_1's auc: 0.89268
Early stopping, best iteration is:
[8175]	training's auc: 0.960348	valid_1's auc: 0.893211
CV score: 0.89830 

from the above we can say taht lightgbm has performed well than all the other models. auc reaching to 0.90

Important features in decending order.

In [0]:
cols = (feature_importance_df[["Feature", "importance"]]
        .groupby("Feature")
        .mean()
        .sort_values(by="importance", ascending=False)[:150].index)
best_features = feature_importance_df.loc[feature_importance_df.Feature.isin(cols)]

plt.figure(figsize=(14,28))
#plotting a bar plot where y represents features and x represents its importance.
sns.barplot(x="importance", y="Feature", data=best_features.sort_values(by="importance",ascending=False))
plt.title('Features importance (averaged/folds)')
plt.tight_layout()
plt.savefig('FI.png')

PrettyTable

In [0]:
#http://zetcode.com/python/prettytable/
from prettytable import PrettyTable

x = PrettyTable()
x.field_names =["Models","Test auc"]
x.add_row(["Logistic ",0.865])
x.add_row(["SVM ",0.865])
x.add_row(["Naive ",0.85])
x.add_row(["LightGbm",0.90])

print(x)
+-----------+----------+
|   Models  | Test auc |
+-----------+----------+
| Logistic  |  0.865   |
|    SVM    |  0.865   |
|   Naive   |   0.85   |
|  LightGbm |   0.9    |
+-----------+----------+

Final model with full dataset with submission.

importing dataset from drive.

In [0]:
train = pd.read_csv("/content/drive/My Drive/train_santander.csv")
test = pd.read_csv("/content/drive/My Drive/test_santander.csv")
In [0]:
train.head(2)
Out[0]:
Unnamed: 0 ID_code target var_0 var_1 var_2 var_3 var_4 var_5 var_6 var_7 var_8 var_9 var_10 var_11 var_12 var_13 var_14 var_15 var_16 var_17 var_18 var_19 var_20 var_21 var_22 var_23 var_24 var_25 var_26 var_27 var_28 var_29 var_30 var_31 var_32 var_33 var_34 var_35 var_36 ... var_169 var_170 var_171 var_172 var_173 var_174 var_175 var_176 var_177 var_178 var_179 var_180 var_181 var_182 var_183 var_184 var_185 var_186 var_187 var_188 var_189 var_190 var_191 var_192 var_193 var_194 var_195 var_196 var_197 var_198 var_199 row_mean row_median row_std row_min row_max row_skew row_kurt row_sum ma
0 0 train_0 0 8.9255 -6.7863 11.9081 5.093 11.4607 -9.2834 5.1187 18.6266 -4.9200 5.7470 2.9252 3.1821 14.0137 0.5745 8.7989 14.5691 5.7487 -7.2393 4.284 30.7133 10.5350 16.2191 2.5791 2.4716 14.3831 13.4325 -5.1488 -0.4073 4.9306 5.9965 -0.3085 12.9041 -3.8766 16.8911 11.1920 10.5785 0.6764 ... 5.4879 -4.7645 -8.4254 20.8773 3.1531 18.5618 7.7423 -10.1245 13.7241 -3.5189 1.7202 -8.4051 9.0164 3.0657 14.3691 25.8398 5.8764 11.8411 -19.7159 17.5743 0.5857 4.4354 3.9642 3.1364 1.6910 18.5227 -2.3978 7.8784 8.5635 12.7803 -1.0914 7.281591 6.77040 9.33154 -21.4494 43.1127 0.101580 1.331023 1456.3182 7.281591
1 1 train_1 0 11.5006 -4.1473 13.8588 5.389 12.3622 7.0433 5.6208 16.5338 3.1468 8.0851 -0.4032 8.0585 14.0239 8.4135 5.4345 13.7003 13.8275 -15.5849 7.800 28.5708 3.4287 2.7407 8.5524 3.3716 6.9779 13.8910 -11.7684 -2.5586 5.0464 0.5481 -9.2987 7.8755 1.2859 19.3710 11.3702 0.7399 2.7995 ... 5.7999 5.5378 5.0988 22.0330 5.5134 30.2645 10.4968 -7.2352 16.5721 -7.3477 11.0752 -5.5937 9.4878 -14.9100 9.4245 22.5441 -4.8622 7.6543 -15.9319 13.3175 -0.3566 7.6421 7.7214 2.5837 10.9516 15.4305 2.0339 8.1267 8.7889 18.3560 1.9518 7.076818 7.22315 10.33613 -47.3797 40.5632 -0.351734 4.110215 1415.3636 7.076818

2 rows × 212 columns

In [0]:
test.head(2)
Out[0]:
Unnamed: 0 ID_code var_0 var_1 var_2 var_3 var_4 var_5 var_6 var_7 var_8 var_9 var_10 var_11 var_12 var_13 var_14 var_15 var_16 var_17 var_18 var_19 var_20 var_21 var_22 var_23 var_24 var_25 var_26 var_27 var_28 var_29 var_30 var_31 var_32 var_33 var_34 var_35 var_36 var_37 ... var_169 var_170 var_171 var_172 var_173 var_174 var_175 var_176 var_177 var_178 var_179 var_180 var_181 var_182 var_183 var_184 var_185 var_186 var_187 var_188 var_189 var_190 var_191 var_192 var_193 var_194 var_195 var_196 var_197 var_198 var_199 row_mean row_median row_std row_min row_max row_skew row_kurt row_sum ma
0 0 test_0 11.0656 7.7798 12.9536 9.4292 11.4327 -2.3805 5.8493 18.2675 2.1337 8.8100 -2.0248 -4.3554 13.9696 0.3458 7.5408 14.5001 7.7028 -19.0919 15.5806 16.1763 3.7088 18.8064 1.5899 3.0654 6.4509 14.1192 -9.4902 -2.1917 5.7107 3.7864 -1.7981 9.2645 2.0657 12.7753 11.3334 8.1462 -0.0610 3.5331 ... 5.1855 4.2603 1.6779 29.0849 8.4685 18.1317 12.2818 -0.6912 10.2226 -5.5579 2.2926 -4.5358 10.3903 -15.4937 3.9697 31.3521 -1.1651 9.2874 -23.5705 13.2643 1.6591 -2.1556 11.8495 -1.4300 2.4508 13.7112 2.4669 4.3654 10.7200 15.4722 -8.7197 7.083202 7.3144 9.910632 -31.9891 42.0248 -0.088518 1.871262 1416.6404 7.083202
1 1 test_1 8.5304 1.2543 11.3047 5.1858 9.1974 -4.0117 6.0196 18.6316 -4.4131 5.9739 -1.3809 -0.3310 14.1129 2.5667 5.4988 14.1853 7.0196 4.6564 29.1609 0.0910 12.1469 3.1389 5.2578 2.4228 16.2064 13.5023 -5.2341 -3.6648 5.7080 2.9965 -10.4720 11.4938 -0.9660 15.3445 10.6361 0.8966 6.7428 2.3421 ... 5.3924 -0.7720 -8.1783 29.9227 -5.6274 10.5018 9.6083 -0.4935 8.1696 -4.3605 5.2110 0.4087 12.0030 -10.3812 5.8496 25.1958 -8.8468 11.8263 -8.7112 15.9072 0.9812 10.6165 8.8349 0.9403 10.1282 15.5765 0.4773 -1.4852 9.8714 19.1293 -20.9760 6.248430 6.4396 9.541267 -41.1924 35.6020 -0.559785 3.391068 1249.6860 6.248430

2 rows × 211 columns

Credit: https://www.kaggle.com/titericz/single-model-using-only-train-counts-information

I thought of trying magic as done in other kernels but this kernal was much more interesting\ because in this values_count() is used in a simple and more readable way.\ leaderboard score after implementing this.\


Public: 0.91497\ Private: 0.91433


taking only those features whose name starts with var, like var_0, var_1 etc.

In [0]:
features = [x for x in train.columns if x.startswith("var")]
In [0]:
#https://www.kaggle.com/titericz/single-model-using-only-train-counts-information

np.corrcoef return Pearson product-moment correlation coefficients and if it is less than zero then reverse the value of train and test.

In [0]:
#Reverse some features.
#Not really necessary for LGB, but helps a little
#https://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html
for var in features:
    if np.corrcoef( train['target'], train[var] )[1][0] < 0:
        train[var] = train[var] * -1
        test[var]  = test[var]  * -1

value_counts() counts the number of time a variable has occured\ once we got the value count then we put that into a dictionary with\ keys as variabe and values as number of count.

In [0]:
#count train values to split Rare/NonRare values
var_stats = {}
for var in features:
    var_stats[var] = train[var].value_counts()

printing the top 10 counts of var_0

In [0]:
var_stats["var_0"].head(10)
Out[0]:
13.0656    11
8.6649     11
10.6829    11
11.9590    10
8.9425     10
10.7369    10
12.9271    10
9.5114     10
10.9468    10
8.9129     10
Name: var_0, dtype: int64

Defining the functions

1. logit() which gives the difference of log(p) and log(1-p)\ 2. var_to_features which creates a new dataframe\ and of shape(200000,4) and coulmn name as var, hist, feature it and var_rank

In [0]:
def logit(p):
    return np.log(p) - np.log(1 - p)

def var_to_feat(vr, var_stats, feat_id ):
    new_df = pd.DataFrame()
    new_df["var"] = vr.values
    new_df["hist"] = pd.Series(vr).map(var_stats)
    new_df["feature_id"] = feat_id
    new_df["var_rank"] = new_df["var"].rank()/200000.
    #print(new_df.shape)
    return new_df.values

creating a target of shape (40000000,) i.e 4 crores

In [0]:
TARGET = np.array( list(train['target'].values) * 200 )
TARGET.shape
Out[0]:
(40000000,)
In [0]:
TARGET
Out[0]:
array([0, 0, 0, ..., 0, 0, 0])

Train

here the idea is appending each 200000 rors 200 times that makes is 4 crore rows and predicting for each individually and then reshaping back to (200000, 200)

In [0]:
#initializing a empty list named TRAIN 
TRAIN = []
#initializing an empty dictionary var_mean
#this will contain mean value of counts
var_mean = {}
#this will contain variance of counts
#initializing an empty dictionary var_var
var_var  = {}
for var in features:
    #for all column in features
    #this below tmp wii be of shape (200000,4)
    tmp = var_to_feat(train[var], var_stats[var], int(var[4:]) )
    #putting mean of var with column as keys and mean with values.
    var_mean[var] = np.mean(tmp[:,0]) 
    #putting variance of var with column as keys and mean with values.
    var_var[var]  = np.var(tmp[:,0])
    #normalizing all tmp and putting that in tmp again
    tmp[:,0] = (tmp[:,0]-var_mean[var])/var_var[var]
    #appending everything in train
    TRAIN.append( tmp )
#this will stack in vertically    
TRAIN = np.vstack( TRAIN )
#taking target values
target = train['target'].values
#deleting train
del train
#garbaze collector deallocates the space
#https://www.geeksforgeeks.org/garbage-collection-python/
_=gc.collect()

print( TRAIN.shape, len( TARGET ) )
(40000000, 4) 40000000

we can see there are 4 column in train

In [0]:
TRAIN[0,:]
Out[0]:
array([-0.1898334,  6.       ,  0.       ,  0.3043175])

Light gbm

choosing the best parameters

In [0]:
#https://medium.com/@pushkarmandot/https-medium-com-pushkarmandot-what-is-lightgbm-how-to-implement-it-how-to-fine-tune-the-parameters-60347819b7fc
# https://www.kaggle.com/titericz/single-model-using-only-train-counts-information
model = lgb.LGBMClassifier(**{
     'learning_rate': 0.03,
     'num_leaves': 31,
     'max_bin': 1023,
     'min_child_samples': 1000,
     'feature_fraction': 1.0,
     'bagging_freq': 1,
     'bagging_fraction': 0.85,
     'objective': 'binary',
     'n_jobs': -1,
     'n_estimators':200,})

Training the model

In [0]:
#taking a total of 10 folds
NFOLDS = 10
predtrain = np.zeros( len(TARGET) )
MODELS = []
skf = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=11111)
for fold_, (train_indexes, valid_indexes) in enumerate(skf.split(TRAIN, TARGET)):
    print('Fold:', fold_ )
    model = model.fit( TRAIN[train_indexes], TARGET[train_indexes],
                      eval_set = (TRAIN[valid_indexes], TARGET[valid_indexes]),
                      verbose = 100,
                      eval_metric='auc',
                      early_stopping_rounds=20,
                      categorical_feature = [2] )
    MODELS.append( model )
    predtrain[valid_indexes] = model.predict_proba( TRAIN[valid_indexes] )[:,1] 

#Reshape to original format 200k x 200
pred = np.reshape( predtrain , (200000,200) , order='F' )
#Use logit for better performance
print( NFOLDS,'-Fold CV AUC:',roc_auc_score( target, np.mean( logit(pred),axis=1)  ) )
_=gc.collect()
Fold: 0
Training until validation scores don't improve for 20 rounds.
Early stopping, best iteration is:
[72]	valid_0's auc: 0.528425	valid_0's binary_logloss: 0.325237
Fold: 1
Training until validation scores don't improve for 20 rounds.
[100]	valid_0's auc: 0.528451	valid_0's binary_logloss: 0.325221
Early stopping, best iteration is:
[138]	valid_0's auc: 0.528489	valid_0's binary_logloss: 0.325215
Fold: 2
Training until validation scores don't improve for 20 rounds.
[100]	valid_0's auc: 0.529245	valid_0's binary_logloss: 0.325213
Early stopping, best iteration is:
[96]	valid_0's auc: 0.529258	valid_0's binary_logloss: 0.325214
Fold: 3
Training until validation scores don't improve for 20 rounds.
[100]	valid_0's auc: 0.527479	valid_0's binary_logloss: 0.325254
Early stopping, best iteration is:
[107]	valid_0's auc: 0.527492	valid_0's binary_logloss: 0.325252
Fold: 4
Training until validation scores don't improve for 20 rounds.
[100]	valid_0's auc: 0.528375	valid_0's binary_logloss: 0.325214
Early stopping, best iteration is:
[101]	valid_0's auc: 0.528383	valid_0's binary_logloss: 0.325214
Fold: 5
Training until validation scores don't improve for 20 rounds.
[100]	valid_0's auc: 0.527999	valid_0's binary_logloss: 0.325232
Early stopping, best iteration is:
[140]	valid_0's auc: 0.528042	valid_0's binary_logloss: 0.325226
Fold: 6
Training until validation scores don't improve for 20 rounds.
Early stopping, best iteration is:
[78]	valid_0's auc: 0.528153	valid_0's binary_logloss: 0.325266
Fold: 7
Training until validation scores don't improve for 20 rounds.
[100]	valid_0's auc: 0.528773	valid_0's binary_logloss: 0.325249
Early stopping, best iteration is:
[112]	valid_0's auc: 0.528792	valid_0's binary_logloss: 0.325246
Fold: 8
Training until validation scores don't improve for 20 rounds.
Early stopping, best iteration is:
[68]	valid_0's auc: 0.527847	valid_0's binary_logloss: 0.325268
Fold: 9
Training until validation scores don't improve for 20 rounds.
Early stopping, best iteration is:
[78]	valid_0's auc: 0.528205	valid_0's binary_logloss: 0.325253
10 -Fold CV AUC: 0.9167636465611065

as we can see that maximum auc we can attain with 10 folds is 0.9167

Test prediction and submission.

In [0]:
#initialising arrays of 2 lack rows and 200 columns with zero values
ypred = np.zeros( (200000,200) )
for feat,var in enumerate(features):
    #build dataset
    tmp = var_to_feat(test[var], var_stats[var], int(var[4:]) )
    #Standard Scale feature according train statistics
    tmp[:,0] = (tmp[:,0]-var_mean[var])/var_var[var]
    tmp[:,1] = tmp[:,1] + 1
    #Write 1 to frequency of values not seem in trainset
    tmp[ np.isnan(tmp) ] = 1
    #Predict testset for N folds
    for model_id in range(NFOLDS):
        model = MODELS[model_id]
        ypred[:,feat] += model.predict_proba( tmp )[:,1] / NFOLDS
#making final prediction with mean of logit.
ypred = np.mean( logit(ypred), axis=1 )

#Submission and finally taking its rank and normalizing it.
sub = test[['ID_code']]
sub['target'] = ypred
sub['target'] = sub['target'].rank() / 200000.
sub.to_csv('santander_good.csv', index=False)
print( sub.head(4) )
  ID_code    target
0  test_0  0.873200
1  test_1  0.915495
2  test_2  0.876500
3  test_3  0.875915

Conclusion:

Lightgbm is giving best results than any other models.\ final auc on test is greater than 0.91

Steps Done:

  1. Importing the necessary libraries.
  2. Visualizing the train and test data.
  3. Checking for null values in train and test data if any.
  4. Describing the data 5.Since pairplot for all the data was not possible so I did it for random 10 data
  5. Analysis of train data where we find out that data is purely unbaanced.
  6. Visualizing the pair plots.
  7. Pdf for all the features from 2 to 202(here we find out that there is some corelations between some of the data.)
  8. Visualising by tsne.
  9. Visualizing mean.
  10. visualising median
  11. visualising dtd
  12. visualising min
  13. visualising max
  14. visualising kurtosis
  15. visualizing skew
  16. visualising moving average.
  17. Visualizing by kde
  18. Visualizing by boxplot
  19. puting all the features to to dataframe.
  20. importing necessary libraries
  21. importing the new train data.
  22. Splitting data into train and test.
  23. Applying different models like naive bayes, logistic regression, svm, lightgbm
  24. Feature importance